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Abstract: This report documents the design and implementation of a binocular, foveated ac¬ 

tive vision system as part of the Cog project at the MIT Artificial Intelligence Laboratory. The ac¬ 
tive vision system features a 3 degree of freedom mechanical platform that supports four color cam¬ 
eras, a motion control system, and a parallel network of digital signal processors for image processing. 
To demonstrate the capabilities of the system, we present results from four sample visual-motor tasks. 
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1 Introduction 


The Cog Project at the MIT Artificial Intelligence Labo¬ 
ratory has focused on the construction of an upper torso 
humanoid robot, called Cog, to explore the hypothesis 
that human-like intelligence requires human-like inter¬ 
actions with the world (Brooks & Stein 1994). Cog has 
sensory and motor systems that mimic human capabil¬ 
ities, including over twenty-one degrees of freedom and 
a variety of sensory systems, including visual, auditory, 
proprioceptive, tactile, and vestibular senses. This paper 
documents the design and implementation of a binocu¬ 
lar, foveated active vision system for Cog. 

In designing a visual system for Cog, we desire a sys¬ 
tem that closely mimics the sensory and sensori-motor 
capabilities of the human visual system. Our system 
should be able to detect stimuli that humans find rele¬ 
vant, should be able to respond to stimuli in a human¬ 
like manner, and should have a roughly anthropomor¬ 
phic appearance. This paper details the design decisions 
necessary to balance the need for human-like visual capa¬ 
bilities with the reality of relying on current technology 
in optics, imaging, motor control, as well as with factors 
such as reliability, cost, and availability. 



Figure 2: One of the two desktop active vision platforms. 


2 System Requirements 


Three similar implementations of the active vision sys¬ 
tem described here were produced. The first, shown in 
Figure 1, is now part of the robot Cog. The second and 
third implementations, one of which is shown in Figure 
2, were constructed as desktop development platforms 
for active vision experiments. 



Figure 1: Cog, an upper-torso humanoid robot. 


The next section describes the requirements of the ac¬ 
tive vision system. Sections 3, 4, 5, and 6 provide the 
details of the camera system, mechanical structure, mo¬ 
tion control system, and image processing system used 
in our implementation. To demonstrate the capabilities 
of the system, we present four sample visual-motor tasks 
in Section 7. 


The active vision system for our humanoid robot should 
mimic the human visual system while remaining easy to 
construct, easy to maintain, and simple to control. The 
system should allow for simple visual-motor behaviors, 
such as tracking and saccades to salient stimuli, as well 
as more complex visual tasks such as hand-eye coordi¬ 
nation, gesture identification, and motion detection. 

While current technology does not allow us to exactly 
mimic all of the properties of the human visual system, 
there are two properties that we desire: wide field of 
view and high acuity. Wide field of view is necessary for 
detecting salient objects in the environment, providing 
visual context, and compensating for ego-motion. High 
acuity is necessary for tasks like gesture identification, 
face recognition, and guiding fine motor movements. In 
a system of limited resources (limited photoreceptors), a 
balance must be achieved between providing wide field of 
view and high acuity. In the human retina, this balance 
results from an unequal distribution of photoreceptors, 
as shown in Figure 3. A high-acuity central area, called 
the fovea, is surrounded by a wide periphery of lower 
acuity. Our active vision system will also need to balance 
the need for high acuity with the need for wide peripheral 
vision. 

We also require that our system be capable of perform¬ 
ing human-like eye movements. Human eye movements 
can be classified into five categories: three voluntary 
movements (saccades, smooth pursuit, and vergence) 
and two involuntary movements (the vestibulo-ocular re¬ 
flex and the optokinetic response) (Kandel, Schwartz & 
Jessell 1992). Saccades focus an object on the fovea 













Figure 3: Density of retinal photoreceptors as a function 
of location. Visual acuity is greatest in the fovea, a very 
small area at the center of the visual field. A discontinu¬ 
ity occurs where axons that form the optic nerve crowd 
out photoreceptor cell bodies, resulting in a blind spot. 
From (Graham 1965). 

through an extremely rapid ballistic change in posi¬ 
tion (up to 900° per second). Smooth pursuit move¬ 
ments maintain the image of a moving object on the 
fovea at speeds below 100° per second. Vergence move¬ 
ments adjust the eyes for viewing objects at varying 
depth. While the recovery of absolute depth may not 
be strictly necessary, relative disparity between objects 
are critical for tasks such as accurate hand-eye coordi¬ 
nation, figure-ground discrimination, and collision de¬ 
tection. The vestibulo-ocular reflex and the optokinetic 
response cooperate to stabilize the eyes when the head 
moves. 

The goal of mimicking human eye movements gener¬ 
ates a number of requirements for our vision system. 
Saccadic movements provide a strong constraint on the 
design of our system because of the high velocities nec¬ 
essary. To obtain high velocities, our system must be 
lightweight, compact, and efficient. Smooth tracking 
motions require high accuracy from our motor control 
system, and a computational system capable of real-time 
image processing. Vergence requires a binocular sys¬ 
tem with independent vertical axis of rotation for each 
eye. The vestibulo-ocular reflex requires low-latency re¬ 
sponses and high accuracy movements, but these re¬ 
quirements are met by any system capable of smooth 
pursuit. The optokinetic response places the least de¬ 
manding requirements on our system; it requires only ba¬ 
sic image processing techniques and slow compensatory 
movements. 1 

With this set of requirements, we can begin to describe 
the design decisions that lead to our current implemen- 

1 Implementations of these two reflexes are currently in 
progress for Cog (Peskin & Scassellati 1997). The desktop de¬ 
velopment platforms have no head motion, and no vestibular 
system, and thus do not require these reflexes. 


tation. We begin in Section 3 with the choice of the 
camera system. Once we have chosen a camera system, 
we can begin to design the mechanical support struc¬ 
tures and to select a motor system capable of fulfilling 
our requirements. Section 4 describes the mechanical 
requirements, and Section 5 gives a description of the 
motor control system that we have implemented. If we 
were to stop at this point, we would have a system with 
a standard motor interface and a standard video output 
signal which could be routed to any image processing 
system. Section 6 describes one of the possible com¬ 
putational systems that satisfies our design constraints 
which we have implemented with the development plat¬ 
forms and with Cog. In all of these sections, we err on 
the side of providing too much information with the hope 
that this document can serve not only as a review of this 
implementation but also as a resource for other groups 
seeking to build similar systems. 

3 Camera System Specifications 

The camera system must have both a wide field of view 
and a high resolution area. There are experimental cam¬ 
era systems that provide both peripheral and foveal vi¬ 
sion from a single camera, either with a variable density 
photoreceptor array (van der Spiegel, Kreider, Claeys, 
Debusschere, Sandini, Dario, Fantini, Belluti & Soncini 
1989) or with distortion lenses that magnify the central 
area (Kuniyoshi, Kita, Sugimoto, Nakamura & Suehiro 
1995). Because these systems are still experimental, fac¬ 
tors of cost, reliability, and availability preclude using 
these options. A simpler alternative is to use two cam¬ 
era systems, one for peripheral vision and one for foveal 
vision. This alternative allows the use of standard com¬ 
mercial camera systems, which are less expensive, have 
better reliability, and are more easily available. Using 
separate foveal and peripheral systems does introduce 
a registration problem; it is unclear exactly how points 
in the foveal image correspond to points in the periph¬ 
eral image. One solution to this registration problem is 
reviewed in Section 7.4. 

The vision system developed for Cog replaced an ear¬ 
lier vision system which used four Elmo ME411E black 
and white remote-head cameras. To keep costs low, and 
to provide some measure of backwards compatibility, we 
elected to retain these cameras in the new design. The 
ME411E cameras are 12 V, 3.2 Watt devices with cylin¬ 
drical remote heads measuring approximately 17 mm in 
diameter and 53 mm in length (without connectors). 
The remote head weighs 25 grams, and will maintain 
broadcast quality NTSC video output at distances up to 
30 meters from the main camera boards. The lower cam¬ 
era of each eye is fitted with a 3 mm lens that gives Cog 
a wide peripheral field of view (88.6°(V) xll5.8°(H)). 
The lens can focus from 10 mm to oo. The upper cam¬ 
era is fitted with a 15 mm lens to provide higher acuity 
in a smaller field of view (18.4°(V) x24.4°(H)). The lens 


2 









focuses objects at distances from 90 mm to oo. This 
creates a fovea region significantly larger than that of 
the human eye, which is 0.3°, but which is significantly 
smaller than the peripheral region. 

For the desktop development platforms, Chinon CX- 
062 color cameras were used. 2 These cameras were con¬ 
siderably less expensive than the Elmo ME411E models, 
and allow us to experiment with color vision. Small re¬ 
mote head cameras were chosen so that each eye is com¬ 
pact and lightweight. To allow for mounting of these 
cameras, a 3 inch ribbon cable connecting the remote 
head and the main board was replaced with a more flex¬ 
ible cable. The upper cameras were fitted with 3 mm 
lenses to provide a wide peripheral field of view. The 
lower cameras were fitted with 11 mm lenses to provide 
a narrow foveal view. Both lenses can focus from 10 mm 
to oo. The CX-062 cameras are 12 V, 1.6 Watt devices 
with a remote board head measuring 40 mm (V) x 36 
mm (H) x 36 mm (D) and a main camera board measur¬ 
ing 65 mm x 100 mm with a maximum clearance of 15 
mm. The CX-062 remote heads weight approximately 
20 grams, but must be mounted within approximately 
.5 meters from the main camera boards. 

4 Mechanical Specifications 

The active vision system has three degrees of freedom 
(DOF) consisting of two active “eyes”. Each eye can in¬ 
dependently rotate about a vertical axis (pan DOF), and 
the two eyes share a horizontal axis (tilt DOF). These 
degrees of freedom allow for human-like eye movements. 3 
Cog also has a 3 DOF neck (pan, tilt, and roll) which 
allows for joint pan movements of the eyes. To allow for 
similar functionality, the desktop platforms were fitted 
with a one degree of freedom neck, which rotates about 
a vertical axis of rotation (neck pan DOF). To approx¬ 
imate the range of motion of human eyes, mechanical 
stops were included on each eye to permit a 120° pan 
rotation and a 60° tilt rotation. 

To minimize the inertia of each eye, we used thin, flex¬ 
ible cables and chrome steel bearings. 4 This allows the 
eyes to move quickly using small motors. For Cog’s head, 
which uses the Elmo ME411E cameras, each fully assem¬ 
bled eye (cameras, connectors, and mounts) occupies a 

2 In retrospect, this choice was unfortunate because the 
manufacturer, Chinon America, Inc. ceased building all 
small-scale cameras approximately one year after the com¬ 
pletion of this prototype. However, a wide variety of com¬ 
mercial remote-head cameras that match these specifications 
are now available. 

3 Human eyes have one additional degree of freedom; they 
can rotate slightly about the direction of gaze. You can ob¬ 
serve this rotation as you tilt your head from shoulder to 
shoulder. This additional degree of freedom is not imple¬ 
mented in our robotic system because the pan and tilt DOFs 
are sufficient to scan the visual space. 

4 We used ABEC-1 chrome steal bearings (part # 77R16) 
from Alpine Bearings. 



Figure 5: Rendering of the desktop active vision system 
produced from the engineering drawings of Figure 4. 



Figure 6: Rendering of Cog’s active vision system. 
Different cameras produce slightly different mechanical 
specifications, resulting in a more compact, but heavier 
eye assembly. 

volume of approximately 42 mm (V) x 18 mm (H) x 88 
mm (D) and weighs about 130 grams. For the develop¬ 
ment platforms, which use the Chinon CX-062 cameras, 
each fully assembled eye occupies a volume of approx¬ 
imately 70 mm (V) x 36 mm (H) x 40 mm (D) and 
weighs about 100 grams. Although significantly heav¬ 
ier and larger than their human counterpart, they are 
smaller and more lightweight than other active vision 
systems (Ballard 1989, Reid, Bradshow, McLauchlan, 
Sharkey, & Murray 1993). 

The mechanical design and machining of the vision 
systems were done by Cynthia Ferrell, Elmer Lee, and 
Milton Wong. Figure 4 shows three orthographic projec¬ 
tions of the mechanical drawings for the desktop devel¬ 
opment platform, and Figures 5 and 6 show renderings 
of both the desktop platform and the system used on 
Cog. The implementation of the initial Cog head proto¬ 
type and the development platforms were completed in 
May of 1996. 

5 Eye Motor System Specifications 

Section 2 outlined three requirements of the eye motor 
system. For Cog’s visual behaviors to be comparable 
to human capabilities, the motor system must be able 
to move the eyes at fast speeds, servo the eyes with fine 
position control, and smoothly move the eyes over a wide 
range of velocities. 

On average, the human eye performs 3 to 4 full range 
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Figure 4: Three orthographic projections of the mechanical schematics of the desktop active vision system. All 
measurements are in inches. 


saccades per second(Kandel et al. 1992). Given this goal, 
Cog’s eye motor system is designed to perform three 120° 
pan saccades per second and three 60° tilt saccades per 
second (with 250 ms of stability in between saccades). 
This specification corresponds to angular accelerations 
of 130 g ra^m an d 655 rad j ns for pan and tilt. 

To meet these requirements, two motors were selected. 
For the pan and tilt of the Cog prototype and for the 
neck pan and tilt on the desktop systems, Maxon 12 
Volt, 3.2 Watt motors with 19.2:1 reduction planetary 
gearboxes were selected. The motor/gearbox assembly 
had a total weight of 61 grams, a maximum diameter of 
16 mm and a length of approximately 60 mm. For the 
desktop development platforms, it was possible to use 
smaller motors for the pan axis. We selected Maxon 12 
Volt, 2.5 Watt motors with 16.58:1 reduction planetary 
gearboxes. This motor/gearbox assembly had a total 
weight of 38 grams, a maximum diameter of 13 mm and 
a total length of approximately 52 mm . 5 

’The 3.2 Watt Maxon motor is part RE016-039- 
08EAB100A and its gearbox is part # GP016A019- 
0019B1A00A. The 2.5 Watt motor is part # RE013- 
032-10EAB101A and its gearbox is part # GP013A020- 
0017B1A00A. 


To monitor position control, each motor was fitted 
with a Hewlett-Packard HEDS-5500 optical shaft en¬ 
coder. The HEDS-5500 has a resolution of 1024 counts 
per revolution. The motor/gearbox/encoder assembly 
was attached to the load through a cable transmission 
system. By modifying the size of the spindles on the 
cable transmission, it was possible to map one full rev¬ 
olution of the motor to the full range of motion of each 
axis. This results in an angular resolution of 8.5 encoder 
ticks/degree for the pan axis and 17 encoder ticks/degree 
for the tilt axis. 

The motors were driven by a set of linear amplifiers, 
which were driven by a commercial 4-axis motor con¬ 
troller (see Figure 7 ). 6 This motor controller maintained 
a 1.25 kHz servo loop at 16 bits of resolution for each 
axis. The motor controller interfaced through the ISA 
bus to a PC and provided a variety of hardware sup¬ 
ported motion profiles including trapezoidal profiles, S- 
curve acceleration and deceleration, parabolic accelera¬ 
tion and deceleration, and constant velocity moves. 


6 The linear amplifiers are model TA-100 amps from Trust 
Automation. The motor controller is an LC/DSP-400 4-axis 
motor controller from Motion Engineering, Inc. 










































Figure 7: Schematic for the electrical wiring of the motor subsystem. The motor control signal (SIG) drives a linear 
amplifier, which produces a differential pair of amplified signals (M+ and M-). Two encoder channels (Ea and Eb) 
return feedback from the motor assembly. 


6 Computational Specifications 

To perform a variety of active vision tasks in real time, 
we desire a system that is high bandwidth, powerful, and 
scalable. The system must have enough bandwidth to 
handle four video streams at full NTSC resolution, and 
be powerful enough to process those streams. Ideally, the 
system should also be easily scalable so that additional 
processing power can be integrated as other tasks are 
required. 

6.1 Parallel Network Architecture 

Based on these criteria, we selected a parallel network ar¬ 
chitecture based on the TIM-40 standard for the Texas 
Instruments TMS320C40 digital signal processor. The 
TIM-40 standard allows third-party manufacturers to 
produce hardware modules based around the C40 pro¬ 
cessor that incorporate special hardware features but can 
still be easily interfaced with each other. For example, 
one TIM-40 module might have specialized hardware for 
capturing video frames while another might have special 
hardware to perform convolutions quickly. Distributed 
computation is feasible because modules communicate 
with each other through high-speed bi-directional ded¬ 
icated hardware links called comports, which were de¬ 
signed to carry full size video streams or other data at 
40 Mbits/second. Depending on the module, between 4 


and 6 comports are available. Additional computational 
power can easily be added by attaching more TIM-40 
modules to the network. Each TIM-40 module connects 
to a standardized backplane that provides power and 
support services. The entire network interfaces to a PC 
through an ISA card (in our system, we use the Hunt 
Engineering HEP-C2 card). 

Figure 8 shows both the general network architecture, 
and the specific TIM-40 modules that are currently at¬ 
tached to one of the development platforms. In this net¬ 
work, four types of TIM-40 module are used. 7 The first 
module type is a generic C40 processor with no addi¬ 
tional capabilities. In this network, the two nodes la¬ 
beled “ROOT” and “P2” are both generic processors. 
The “ROOT” node is special only in that one of its 
comports is dedicated to communications to the host 
computer. The second module type, labeled “VIP”, 
for “Visual Information Processor”, contains dedicated 
hardware to quickly compute convolutions. The third 
module type, labeled “AGD”, or “Accelerated Graphics 
Display”, has hardware to drive a VGA monitor. This 
module is very useful for displaying processed images 
while debugging. The fourth module type has hard- 


7 The four module types are sold by Traquair Data Sys¬ 
tems, Inc., with catalog numbers HET40Ex, VIPTIM, AGD, 
and HECCFG44 respectively. 




to motors 



Pentium Pro PC DSP Network 


Figure 8: General network architecture and specific connectivity of the DSP network attached to one development 
platform. A Pentium Pro PC hosts both the motor controller and a DSP interface card. The DSP network receives 
video input directly and communicates motor commands back to the controller through the DSP interface. For 
further explanation, see the text. 


ware to grab frames from an incoming video signal. The 
four instances of this module are labeled “Right Wide”, 
“Right Fovea”, “Left Fovea”, and “Left Wide” in the fig¬ 
ure. Connections between processors are shown by single 
lines. Because the number of comports are limited, the 
connectivity in the network is asymmetric. As we will 
see in the next section, this only presents a minor prob¬ 
lem to programming, since virtual connectivity can be 
established between any two processors in the network. 

6.2 Software Environment 

To take advantage of the high-speed interprocessor con¬ 
nections in the C40 network, we use a commercial soft¬ 
ware package called Parallel C from 3L, Ltd. Parallel C 
is a multi-threading C library and runtime system which 
essentially creates a layer of abstraction built upon the 
ANSI C programming language. Parallel C consists of 
three main parts: 

• Runtime libraries and compiler macros, which pro¬ 
vide routines for multi-threading and interprocessor 
communication, as well as standard ANSI C func¬ 
tions. 

• A microkernel, running on each C40 node, which 
handles multitasking, communication, and trans¬ 
parent use of I/O throughout a network. 


• A host server, running on the PC, which handles the 
front-end interface to the C40 network, including 
downloading applications and providing a standard 
input and output channels. 

Compiling and linking are done with the Texas Instru¬ 
ments C compiler. 

Parallel C also provides facilities for connecting tasks 
on processors that do not share a physical comport con¬ 
nection through the use of virtual channels. Virtual 
channels are one-way data streams which transmit data 
from an output port to an input port in an in-order, 
guaranteed way. A channel might be mapped directly to 
a physical comport connection or it might travel through 
several nodes in the network, but both cases can be 
treated identically in software. The microkernels on each 
processor automatically handle virtual channels, ensur¬ 
ing that data gets from one task’s output port to another 
task’s input port, as long as some chain of available phys¬ 
ical comport connections exists. 

7 Example Tasks 

A number of research projects have made use of 
these active vision platforms (Marjanovic, Scassellati & 
Williamson 1996, Scassellati 1997, Banks & Scassellati 
1997, Peskin & Scassellati 1997, Yamato 1997, Ferrell 
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1997, Kemp 1997, Irie 1997). This section makes no at¬ 
tempt at summarizing these diverse projects. Instead, 
we review a few examples to evaluate the capabilities 
of the vision system. We focus on tasks that demon¬ 
strate the hardware capabilities of the mechanical system 
rather than complex visual processing. These examples 
are not meant to be complete functional units, only as 
basic tests of the vision platform. 

We begin with an example of adaptive saccades, and 
an example of how to use this information to saccade 
to salient stimuli. We also present an example that em¬ 
phasizes the rapid response of the system for smooth 
pursuit tracking. The final example is a solution to the 
registration problem described in section 3. All of the 
data presented was collected with the desktop develop¬ 
ment platform shown in Figure 2. 

7.1 Adaptive Saccades 

Distortion effects from the wide-angle lens create a non¬ 
linear mapping between the location of an object in 
the image plane and the motor commands necessary to 
foveate that object. One method for compensating for 
this problem would be to exactly characterize the kine¬ 
matics and optics of the vision system. However, this 
technique must be recomputed not only for every in¬ 
stance of the system, but also every time a system’s kine¬ 
matics or optics are modified in even the slightest way. 
To obtain accurate saccades without requiring an accu¬ 
rate kinematic and optic model, we use an unsupervised 
learning algorithm to estimate the saccade function. 

An on-line learning algorithm was implemented to in¬ 
crementally update an initial estimate of the saccade 
map by comparing image correlations in a local field. 
The example described here uses a 17 x 17 interpolated 
lookup table to estimate the saccade function. We are 
currently completing a comparative study between vari¬ 
ous machine learning techniques on this task (Banks & 
Scassellati 1997). 

Saccade map training begins with a linear estimate 
based on the range of the encoder limits (determined 
during self-calibration). For each learning trial, we gen¬ 
erate a random visual target location ( Xt,yt ) within the 
128 x 128 image array and record the normalized image 
intensities It in a 13 x 13 patch around that point. The 
reduced size of the image array allows us to quickly train 
a general map, with the possibility for further refine¬ 
ment after the course mapping has been trained. Once 
the random target is selected, we issue a saccade mo¬ 
tor command using the current map estimate. After the 
saccade, a new image J t+ i is acquired. The normalized 
13x13 center of the new image is then correlated against 
the target image. Thus, for offsets xo and yo, we seek to 
maximize the dot-product of the image vectors: 


max 

xo,Vo 


i,j) ■ It+i(i + x 0 ,j + yo) 


(1) 



Figure 9: Saccade Map after 0 (dashed lines) and 2000 
(solid lines) learning trials. The figure shows the pan 
and tilt encoder offsets necessary to foveate every tenth 
position in a 128 by 128 image array within the ranges 
x= [10,110] (pan) and y= [20,100] (tilt). 
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Figure 10: L 2 error for saccades to image positions (x,y) 
after 0 training trials. 


Because each image was normalized, maximizing the dot 
product of the image vectors is identical to minimizing 
the angle between the two vectors. This normalization 
also gives the algorithm a better resistance to changes 
in background luminance as the camera moves. In our 
experiments, we only examine offsets Xq and yo in the 
range of [—32,32]. The offset pair that maximized the 
expression in Equation 1, scaled by a constant factor, is 
used as the error vector for training the saccade map. 

Figure 9 shows the data points in their initial linear 
approximation (dashed lines) and the resulting map after 
2000 learning trials (solid lines). The saccade map after 
2000 trials clearly indicates a slight counter-clockwise ro¬ 
tation of the mounting of the camera, which was verified 
by examination of the hardware. Figure 10 shows the L 2 
error distance for saccades after 0 learning trials. After 
2000 training trials, an elapsed time of approximately 
1.5 hours, training reaches an average L 2 error of less 
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Figure 11: L 2 error for saccades to image positions (x,y) 
after 2000 training trials. 

than 1 pixel (Figure 11). As a result of moving objects 
during subsequent training and the imprecision of the 
correlation technique, this error level remained constant 
regardless of continued learning. 

7.2 Saccades to Motion Stimuli 

By combining the saccade map with visual process¬ 
ing techniques, simple behaviors can be produced. To 
demonstrate this, we provide here a simple example us¬ 
ing visual motion as a saliency test. Any more complex 
evaluation of saliency can easily be substituted using this 
simple formulation. 

A motion detection module computes the difference 
between consecutive wide-angle images within a local 
field. A motion segmenter then uses a region-growing 
technique to identify contiguous blocks of motion within 
the difference image. The centroid of the largest motion 
block is then used as a saccade target using the trained 
saccade map from section 7.1. 

The motion detection process receives a digitized 64 x 
64 image from the right wide-angle camera. Incoming 
images are stored in a ring of three frame buffers; one 
buffer holds the current image Jo, one buffer holds the 
previous image Ii, and a third buffer receives new in¬ 
put. The absolute value of the difference between the 
grayscale values in each image is thresholded to provide 
a raw motion image ( I raw = T(\Iq — Ii|)). The dif¬ 
ference image is then segmented using a region-growing 
technique. The segmenter process scans the raw motion 
image marking all locations which pass threshold with 
an identifying tag. Locations inherit tags from adjacent 
locations through a region grow-and-merge procedure. 
Once all locations above threshold have been tagged, the 
tag that has been assigned to the most locations is de¬ 
clared the “winner”. The centroid of the winning tag is 
computed, converted into a motor command using the 
saccade map, and sent to the motors. 

7.3 Smooth Pursuit Tracking 

While saccades provide one set of requirements for our 
motor system, it is also necessary to examine the perfor¬ 


mance of the system on smooth pursuit tracking. 8 Our 
example of smooth pursuit tracking acquires a visual tar¬ 
get at startup and attempts to maintain the foveation of 
that target. 

The central 7x7 patch of the initial 64 x 64 image is 
installed as the target image. In this instance, we use a 
very small image to reduce the computational load nec¬ 
essary to track non-artifact features of an object. For 
each successive image, the central 44 x 44 patch is corre¬ 
lated with the 7x7 target image. The best correlation 
value gives the location of the target within the new im¬ 
age, and the distance from the center of the visual field 
to that location gives the motion vector. The length of 
the motion vector is the pixel error. The motion vector 
is scaled by a constant (based on the time between iter¬ 
ations) and used as a velocity command to the motors. 


Cumulative Pixel Error during Tracking 



Figure 12: Cumulative L 2 pixel error accumulated while 
tracking a continuously moving object. There are thirty 
timesteps per second. 

While simple, this tracking routine performs well for 
smoothly moving real-world objects. Figure 12 shows 
the cumulative pixel error while tracking a mug moving 
continuously in circles in a cluttered background for ten 
seconds. An ideal tracker would have an average pixel er¬ 
ror of 1, since the pixel error is recorded at each timestep 
and it requires a minimum of one pixel of motion before 
any compensation can occur. In the experiment shown 
here, the average pixel error is 1.23 pixels per timestep. 
(This may result from diagonal movements of the target 
between consecutive timesteps; a diagonal movement re¬ 
sults in a pixel error of >/2 .) This example demonstrates 
that the motor system can respond quickly enough to 
track smoothly. 

7.4 Registering the Foveal and Peripheral 
Images 

Using two cameras for peripheral and foveal vision al¬ 
lows us to use commercial equipment, but results in a 

8 Given saccades and smooth pursuit, vergence does not 
place any additional requirements on the responsiveness of 
the motor system. 





registration problem between the two images. We would 
like a registration function that describes how the foveal 
image maps into the peripheral image, that is, a function 
that converts positions in the foveal image into positions 
in the peripheral image. Because the foveal image has 
a small aperture, there is little distortion and the im¬ 
age linearly maps to distances in the environment. The 
peripheral image is non-linear near the edges, but was 
determined to be relatively linear near the center of the 
field of view (see section 7.1). Because the relevant por¬ 
tions of both images are linear, we can completely de¬ 
scribe a registration function by knowing the scale and 
offsets that need to be applied to the foveal image to 
map it directly into the peripheral image. 

One solution to this problem would be to scale the 
foveal image to various sizes and then correlate the scaled 
images with the peripheral image to find a corresponding 
position. By maximizing over the scale factors, we could 
determine a suitable mapping function. This search 
would be both costly and inexact. Scaling to non-integer 
factors would be computationally intensive, and exactly 
how to perform that scaling is questionable. Also, ar¬ 
bitrary scaling may cause correlation artifacts from fea¬ 
tures that recur at multiple scales. 

Another alternative is to exploit the mechanical sys¬ 
tem to obtain an estimate of the scale function. Since 
both cameras share the pan axis, by tracking the back¬ 
ground as we move the eye at a constant velocity we 
can determine an estimate of the scale between cameras. 
With the eye panning at a constant velocity, separate 
processors for the foveal and peripheral images track the 
background, keeping an estimate of the total displace¬ 
ment. After moving through the entire range, we esti¬ 
mate the scale between images using the following for¬ 
mula: 


ScttlCpan — 


Displacement peripheral 
Displacement f OV eai 


( 2 ) 


While the tilt axis does not pass through the focal points 
of both cameras, we can still obtain a similar scaling 
factor for the tilt dimension. Because we average over 
the entire field, and do not compare directly between the 
foveal and peripheral images, a similar equation holds for 
the tilt scaling factor. Once the scaling factor is known, 
we can scale the foveal image and convolve to find the 
registration function parameters. 

We have experimentally determined the registration 
function parameters for the desktop development plat¬ 
form using this method. Over a series of ten experi¬ 
mental trials using the above method, the average scale 
factor for both the pan and tilt dimension were both de¬ 
termined to be 4.0, with a standard deviation of .1. The 
scaled foveal image was best located at a position 2 pix¬ 
els above and 14 pixels from the center of the 128 x 128 
peripheral image (see Figure 13). As a control, the same 
experiment produced on the cameras of the other eye 
produce exactly the same scaling factor (which is a prod¬ 
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Figure 13: Registration of the foveal and peripheral im¬ 
ages. The foveal image (top) correlates to a patch in 
the 128 x 128 peripheral image (bottom) that is approx¬ 
imately one-fourth scale and at an offset of 2 pixels above 
and 14 pixels right from the center. 


uct of the camera and lens choices), but different offset 
positions (which are a result of camera alignment in their 
respective mounts). 

8 Conclusions 

This report has documented the design and construction 
of a binocular, foveated active vision system. The vision 
system combines a high acuity central area and a wide 
peripheral field by using two cameras for each eye. This 
technique introduces a registration problem between the 
camera images, but we have shown how simple active 
vision techniques can compensate for this problem. We 
have also presented a number of sample visual behaviors, 
including adaptive saccading, saccades to salient stimuli, 
and tracking, to demonstrate the capabilities of this sys¬ 
tem. 
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